Нелинейное снижение размерности методом Isomap

isomap

Багаев Д.В.

Non-linear dimensionality reduction

Problem

High-dimensional data, meaning data that requires more than two or three dimensions to represent, can be difficult to interpret. One approach to simplification is to assume that the data of interest lie on an embedded non-linear manifold within the higher-dimensional space. If the manifold is of low enough dimension, the data can be visualised in the low-dimensional space.

Multi-dimensional scaling

Map $x\rightarrow y$ preserving distances as much as possible.

  • Approaches:
    • absolute difference $$ \sum_{i,j} (||x_i - x_j|| - ||y_i - y_j||)^2 \rightarrow \min_Y $$
    • relative difference (more attention to small distances) $$ \sum_{i,j} \frac{(||x_i - x_j|| - ||y_i - y_j||)^2}{||x_i - x_j||^2} \rightarrow \min_Y $$

Non-linear dimensionality reduction

  • Based on assumption that original data $x \in R^D $ is distributed compactly on non-linear surface with dimensionality $d < D$
  • Let $ y \in R^d $ denote the coordinates of $x$ on the surface
  • Sample dataset: isomap
  • Linear dimensionality reduction techniques will fail here
In [12]:
print("Typical datasets for dimensionality reduction evaluation")
typicalDatasets.show()
Typical datasets for dimensionality reduction evaluation

Comment : True datasets have much more dimensions, more complex structure, errors, outliers etc.

Analysis

Issue Issue: small $||x_i - x_j||$ should not always imply small $||y_i - y_j||$

Solution

Isomap : Map $x \Rightarrow y$ preserving correspondence beetwen distance in target space and geodesic distance along the surface in original space. Solution

Isomap

Algorithm : There are four stages to Isomap:

  1. Determine a neighbourhood graph $G$ of the observed data {$x_i$} in a suitable way. For example, $G$ might contain $x_ix_j$ iff $x_j$ is one of the $k$ nearest neighbours of $x_i$ (and vice versa). Alternatively, $G$ might contain the edge $x_ix_j$ iff $||x_i - x_j|| < \varepsilon $ for some $\varepsilon$.
  2. For each $x_i$ find its $K$ nearest neighbours.
  3. Compute shortes paths in the graph for all pairs of data points. Each edge $x_ix_j$ in the graph is weighted by its Euclidean length $|x_i - x_j|$, or by some other useful metric. Here Dijkstra or Floyd algorithms can be used.
  4. Apply MDS to the resulting shortest-path distance matrix D to find new embedding of the data in Euclidien space, apporimating $Y$.

Complexity

  • kNN
    • brute-force
      • M nearest neighbours $O(MN)$
    • kd-tree
      • building a kd-tree has O(N·logN) time complexity and O(K·N) space complexity
      • nearest neighbor search - close to O(logN)
      • M nearest neighbors - close to O(M·logN)
    • ball_tree etc
  • Shortest paths
    • Floyd - This procedure requires $O(N^3)$ operations. More efficient algorithms exploiting the sparse structure of the neighborhood graph can be found
    • Dijkstra - $O((|E| + |V|)log|V|)$, where $V$ - number of nodes, $E$ - number of edges
  • MDS
    • Classical MDS requires $O(D + dN^2)$

Issues of Isomap

  • Noisy observations beetwen distant parts of surfaces may make distant parts close
  • Solutions
    • remove observations with large total flows through them
    • remove nearest neighbours that vilate local linaerity
  • Selection of $K$
    • if too small, then poor approximation of geodesic distance
    • if too large, then increases chance of "short-circuiting" through noisy observations.

Implementations

  • Python
  • R
  • MatLab
  • GitHub
In [17]:
from sklearn.datasets import fetch_mldata
import numpy as np
mnist = fetch_mldata('MNIST original', data_home="./")
numberImages = []
for i in range(48):
    numberImages.append((mnist.data[1250 * i].reshape(28, 28), ''))
images = RowImages(numberImages, columns = 8)
images.show(cmap='gray_r')
In [19]:
from sklearn.manifold import Isomap
# use only 1/10 of the data: full dataset takes a long time!
data = mnist.data[::10]
target = mnist.target[::10]

model = Isomap(n_components=2)
proj = model.fit_transform(data)
plt.figure(figsize=(16, 8))
plt.scatter(proj[:, 0], proj[:, 1], c=target, cmap=plt.cm.get_cmap('jet', 10))
plt.colorbar(ticks=range(10))
plt.clim(-0.5, 9.5);
In [17]:
fig, ax = plt.subplots(1, 2, sharey=True, figsize=(16, 6))
for i in range(2):
    plots[i].plotWithImages(ax[i], cmap='gray_r', thumb_frac=0.08)

The result gives you an idea of the variety of forms that the number "1" can take within the dataset. The data lies along a broad curve in the projected space, which appears to trace the orientation of the digit. As you move up the plot, you find ones that have hats and/or bases, though these are very sparse within the dataset. The projection lets us identify outliers that have data issues: for example, pieces of the neighboring digits that snuck into the extracted images.

In [18]:
fig, ax = plt.subplots(1, 2, sharey=True, figsize=(16, 6))
for i in range(2):
    plots[i + 2].plotWithImages(ax[i], cmap='gray_r', thumb_frac=0.1)
In [20]:
otherFig, otherAx = plt.subplots(3, 2, sharey=True, figsize=(16, 10))
for i in range(3):
    for j in range(2):
        otherNumberPlots[i * 2 + j].plotWithImages(otherAx[i][j], cmap='gray_r', thumb_frac=0.08)
In [22]:
facesImages = []
for i in range(60):
    facesImages.append((faces.images[i], ''))
images = RowImages(facesImages, columns = 12)
images.show(cmap='gray')
In [24]:
figFaces, axFaces = plt.subplots(figsize=(16, 8))
facesPlot.plotWithImages(axFaces, images=faces.images[:, ::2, ::2])

The result is interesting: the first two Isomap dimensions seem to describe global image features: the overall darkness or lightness of the image from left to right, and the general orientation of the face from bottom to top. This gives us a nice visual indication of some of the fundamental features in our data.

In [26]:
putinModel = Isomap(n_components=2)
putinPlot = PlotComponents(putinData, putinModel)
figPutin, axPutin = plt.subplots(figsize=(16, 8))
putinImages = faces.images[faces.target==putinIndex]
putinPlot.plotWithImages(axPutin, images=putinImages, thumb_frac=0.0, reshape=())
In [20]:
from sklearn.datasets import fetch_olivetti_faces
olivettiDataset = fetch_olivetti_faces()

olivettiImages = []
for i in range(60):
    olivettiImages.append((olivettiDataset.images[i], ''))
images = RowImages(olivettiImages, columns = 12)
images.show(cmap='gray')

print(olivettiDataset.data.shape)
(400, 4096)
In [25]:
figOlivetti, axOlivetti = plt.subplots(figsize=(16, 8))
olivettiPlot.plotWithImages(axOlivetti, images=olivettiDataset.images)

Classification

Return to digits images. We have 70000 data points, each an 28x28 image -> 784 dimensional vector.

In [27]:
X = mnist.data[::30]
y = mnist.target[::30]
print(X.shape)
print(y.shape)
(2334, 784)
(2334,)

To evaluate the algorithm, split data into training and testing part

In [28]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
print("X_train shape: %s" % repr(X_train.shape))
print("y_train shape: %s" % repr(y_train.shape))
print("X_test shape: %s" % repr(X_test.shape))
print("y_test shape: %s" % repr(y_test.shape))
X_train shape: (1750, 784)
y_train shape: (1750,)
X_test shape: (584, 784)
y_test shape: (584,)
In [29]:
isomapModel = Isomap(n_components=2, n_neighbors=5)
fittedIsomapModel = isomapModel.fit(X_train)
projection = fittedIsomapModel.transform(X_train)
In [30]:
plt.figure(figsize=(16, 8))
plt.scatter(projection[:, 0], projection[:, 1], c=y_train, cmap=plt.cm.get_cmap('jet', 10));
In [31]:
testProjection = fittedIsomapModel.transform(X_test)
In [32]:
plt.figure(figsize=(16, 8))
plt.scatter(testProjection[:, 0], testProjection[:, 1], c=y_test, cmap=plt.cm.get_cmap('jet', 10));
In [53]:
import math
def distance(p1, p2):
    return math.sqrt((p1[0] - p2[0]) ** 2 + (p1[1] - p2[1]) ** 2)

success = 0
k = 9
for i in range(len(testProjection)):
    testPoint = testProjection[i]
    testPointLabel = y_test[i]
    #print("Point: ", testPoint, ", Label: ", testPointLabel)
    testMinsForPoint = list(enumerate(list(map(lambda p: distance(p, testPoint), projection))))
    kNeighbours = sorted(testMinsForPoint, key=lambda t: t[1])
    distances = list(map(lambda n: n[1], kNeighbours[:k]))
    indices = list(map(lambda n: n[0], kNeighbours[:k]))
    labels = list(map(lambda i: y_train[i], indices))
    m = {}
    for i, label in enumerate(labels):
        if (m.get(label) == None):
            m[label] = 1.0 / distances[i]
        else:
            m[label] += 1.0 / distances[i]
    if max(m, key=m.get) == testPointLabel:
        success += 1
    #print("Test: ", y_train[classificationIndex])
In [54]:
print("Success rate", float(success) / len(testProjection))
Success rate 0.559931506849315
In [35]:
isomapModel3 = Isomap(n_components=3, n_neighbors=5)
fittedIsomapModel3 = isomapModel3.fit(X_train)
projection3 = fittedIsomapModel3.transform(X_train)
testProjection3 = fittedIsomapModel3.transform(X_test)
In [36]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(16, 8))
ax = fig.add_subplot(111, projection='3d')
_ = ax.scatter(projection3[:, 0], projection3[:, 1], projection3[:, 2], c=y_train, cmap=plt.cm.get_cmap('jet', 10))
In [37]:
fig = plt.figure(figsize=(16, 8))
ax = fig.add_subplot(111, projection='3d')
_ = ax.scatter(testProjection3[:, 0], testProjection3[:, 1], testProjection3[:, 2], c=y_test, cmap=plt.cm.get_cmap('jet', 10))
In [51]:
def distance3(p1, p2):
    return math.sqrt((p1[0] - p2[0]) ** 2 + (p1[1] - p2[1]) ** 2 + (p1[2] - p2[2]) ** 2)

success3 = 0
k = 9
for i in range(len(testProjection3)):
    testPoint = testProjection3[i]
    testPointLabel = y_test[i]
    #print("Point: ", testPoint, ", Label: ", testPointLabel)
    testMinsForPoint = list(enumerate(list(map(lambda p: distance3(p, testPoint), projection3))))
    kNeighbours = sorted(testMinsForPoint, key=lambda t: t[1])
    indices = list(map(lambda n: n[0], kNeighbours[:k]))
    distances = list(map(lambda n: n[1], kNeighbours[:k]))
    labels = list(map(lambda i: y_train[i], indices))
    m = {}
    for i, label in enumerate(labels):
        if (m.get(label) == None):
            m[label] = 1.0 / distances[i]
        else:
            m[label] += 1.0 / distances[i]
    if max(m, key=m.get) == testPointLabel:
        success3 += 1
    #print("Test: ", y_train[classificationIndex])
In [52]:
print("Success rate for n_components=3: ", float(success3) / len(testProjection3))
Success rate for n_components=3:  0.6643835616438356

Thank you for your attention!

In [ ]: